Differential dynamic programming

From HandWiki
Short description: Algorithm for trajectory optimization

Differential dynamic programming (DDP) is an optimal control algorithm of the trajectory optimization class. The algorithm was introduced in 1966 by Mayne[1] and subsequently analysed in Jacobson and Mayne's eponymous book.[2] The algorithm uses locally-quadratic models of the dynamics and cost functions, and displays quadratic convergence. It is closely related to Pantoja's step-wise Newton's method.[3][4]

Finite-horizon discrete-time problems

The dynamics

𝐱i+1=𝐟(𝐱i,𝐮i)

 

 

 

 

(1)

describe the evolution of the state 𝐱 given the control 𝐮 from time i to time i+1. The total cost J0 is the sum of running costs and final cost f, incurred when starting from state 𝐱 and applying the control sequence 𝐔{𝐮0,𝐮1,𝐮N1} until the horizon is reached:

J0(𝐱,𝐔)=i=0N1(𝐱i,𝐮i)+f(𝐱N),

where 𝐱0𝐱, and the 𝐱i for i>0 are given by Eq. 1. The solution of the optimal control problem is the minimizing control sequence 𝐔*(𝐱)argmin𝐔J0(𝐱,𝐔). Trajectory optimization means finding 𝐔*(𝐱) for a particular 𝐱0, rather than for all possible initial states.

Dynamic programming

Let 𝐔i be the partial control sequence 𝐔i{𝐮i,𝐮i+1,𝐮N1} and define the cost-to-go Ji as the partial sum of costs from i to N:

Ji(𝐱,𝐔i)=j=iN1(𝐱j,𝐮j)+f(𝐱N).

The optimal cost-to-go or value function at time i is the cost-to-go given the minimizing control sequence:

V(𝐱,i)min𝐔iJi(𝐱,𝐔i).

Setting V(𝐱,N)f(𝐱N), the dynamic programming principle reduces the minimization over an entire sequence of controls to a sequence of minimizations over a single control, proceeding backwards in time:

V(𝐱,i)=min𝐮[(𝐱,𝐮)+V(𝐟(𝐱,𝐮),i+1)].

 

 

 

 

(2)

This is the Bellman equation.

Differential dynamic programming

DDP proceeds by iteratively performing a backward pass on the nominal trajectory to generate a new control sequence, and then a forward-pass to compute and evaluate a new nominal trajectory. We begin with the backward pass. If

(𝐱,𝐮)+V(𝐟(𝐱,𝐮),i+1)

is the argument of the min[] operator in Eq. 2, let Q be the variation of this quantity around the i-th (𝐱,𝐮) pair:

Q(δ𝐱,δ𝐮)(𝐱+δ𝐱,𝐮+δ𝐮)+V(𝐟(𝐱+δ𝐱,𝐮+δ𝐮),i+1)(𝐱,𝐮)V(𝐟(𝐱,𝐮),i+1)

and expand to second order

12[1δ𝐱δ𝐮]T[0Q𝐱TQ𝐮TQ𝐱Q𝐱𝐱Q𝐱𝐮Q𝐮Q𝐮𝐱Q𝐮𝐮][1δ𝐱δ𝐮]

 

 

 

 

(3)

The Q notation used here is a variant of the notation of Morimoto where subscripts denote differentiation in denominator layout.[5] Dropping the index i for readability, primes denoting the next time-step VV(i+1), the expansion coefficients are

Q𝐱=𝐱+𝐟𝐱TV'𝐱Q𝐮=𝐮+𝐟𝐮TV'𝐱Q𝐱𝐱=𝐱𝐱+𝐟𝐱TV'𝐱𝐱𝐟𝐱+V𝐱𝐟𝐱𝐱Q𝐮𝐮=𝐮𝐮+𝐟𝐮TV'𝐱𝐱𝐟𝐮+V'𝐱𝐟𝐮𝐮Q𝐮𝐱=𝐮𝐱+𝐟𝐮TV'𝐱𝐱𝐟𝐱+V'𝐱𝐟𝐮𝐱.

The last terms in the last three equations denote contraction of a vector with a tensor. Minimizing the quadratic approximation (3) with respect to δ𝐮 we have

δ𝐮*=argminδ𝐮Q(δ𝐱,δ𝐮)=Q𝐮𝐮1(Q𝐮+Q𝐮𝐱δ𝐱),

 

 

 

 

(4)

giving an open-loop term 𝐤=Q𝐮𝐮1Q𝐮 and a feedback gain term 𝐊=Q𝐮𝐮1Q𝐮𝐱. Plugging the result back into (3), we now have a quadratic model of the value at time i:

ΔV(i)=12Q𝐮TQ𝐮𝐮1Q𝐮V𝐱(i)=Q𝐱Q𝐱𝐮Q𝐮𝐮1Q𝐮V𝐱𝐱(i)=Q𝐱𝐱Q𝐱𝐮Q𝐮𝐮1Q𝐮𝐱.

Recursively computing the local quadratic models of V(i) and the control modifications {𝐤(i),𝐊(i)}, from i=N1 down to i=1, constitutes the backward pass. As above, the Value is initialized with V(𝐱,N)f(𝐱N). Once the backward pass is completed, a forward pass computes a new trajectory:

𝐱^(1)=𝐱(1)𝐮^(i)=𝐮(i)+𝐤(i)+𝐊(i)(𝐱^(i)𝐱(i))𝐱^(i+1)=𝐟(𝐱^(i),𝐮^(i))

The backward passes and forward passes are iterated until convergence. If the Hessians Q𝐱𝐱,Q𝐮𝐮,Q𝐮𝐱,Q𝐱𝐮 are replaced by their Gauss-Newton approximation, the method reduces to the iterative Linear Quadratic Regulator (iLQR).[6]

Differential dynamic programming is a second-order algorithm like Newton's method. It therefore takes large steps toward the minimum and often requires regularization and/or line-search to achieve convergence.[7][8] Regularization in the DDP context means ensuring that the Q𝐮𝐮 matrix in Eq. 4 is positive definite. Line-search in DDP amounts to scaling the open-loop control modification 𝐤 by some 0<α<1.

Monte Carlo version

Sampled differential dynamic programming (SaDDP) is a Monte Carlo variant of differential dynamic programming.[9][10][11] It is based on treating the quadratic cost of differential dynamic programming as the energy of a Boltzmann distribution. This way the quantities of DDP can be matched to the statistics of a multidimensional normal distribution. The statistics can be recomputed from sampled trajectories without differentiation.

Sampled differential dynamic programming has been extended to Path Integral Policy Improvement with Differential Dynamic Programming.[12] This creates a link between differential dynamic programming and path integral control,[13] which is a framework of stochastic optimal control.

Constrained problems

Interior Point Differential dynamic programming (IPDDP) is an interior-point method generalization of DDP that can address the optimal control problem with nonlinear state and input constraints.[14]

See also

References

  1. Mayne, D. Q. (1966). "A second-order gradient method of optimizing non-linear discrete time systems". Int J Control 3: 85–95. doi:10.1080/00207176608921369. 
  2. Mayne, David Q.; Jacobson, David H. (1970). Differential dynamic programming. New York: American Elsevier Pub. Co.. ISBN 978-0-444-00070-5. https://books.google.com/books?id=tA-oAAAAIAAJ. 
  3. de O. Pantoja, J. F. A. (1988). "Differential dynamic programming and Newton's method". International Journal of Control 47 (5): 1539–1553. doi:10.1080/00207178808906114. ISSN 0020-7179. 
  4. Liao, L. Z. (1992). Advantages of differential dynamic programming over Newton's method for discrete-time optimal control problems. https://hdl.handle.net/1813/5474. 
  5. Morimoto, J.; G. Zeglin; C.G. Atkeson (2003). "Minimax differential dynamic programming: Application to a biped walking robot". 2. pp. 1927–1932. 
  6. Baumgärtner, K. (2023). "A Unified Local Convergence Analysis of Differential Dynamic Programming, Direct Single Shooting, and Direct Multiple Shooting". 2023 European Control Conference (ECC). pp. 1–7. doi:10.23919/ECC57647.2023.10178367. https://ieeexplore.ieee.org/document/10178367. 
  7. Liao, L. Z; C. A Shoemaker (1991). "Convergence in unconstrained discrete-time differential dynamic programming". IEEE Transactions on Automatic Control 36 (6): 692. doi:10.1109/9.86943. 
  8. Tassa, Y. (2011). Theory and implementation of bio-mimetic motor controllers (PDF) (Thesis). Hebrew University. Archived from the original (PDF) on 2016-03-04. Retrieved 2012-02-27.
  9. "Sampled differential dynamic programming" (in en-US). doi:10.1109/IROS.2016.7759229. 
  10. Rajamäki, Joose; Hämäläinen, Perttu (June 2018). "Regularizing Sampled Differential Dynamic Programming - IEEE Conference Publication" (in en-US). 2018 Annual American Control Conference (ACC). pp. 2182–2189. doi:10.23919/ACC.2018.8430799. https://ieeexplore.ieee.org/document/8430799. Retrieved 2018-10-19. 
  11. Rajamäki, Joose (2018) (in en). Random Search Algorithms for Optimal Control. Aalto University. ISBN 978-952-60-8156-4. http://urn.fi/URN:ISBN:978-952-60-8156-4. 
  12. Lefebvre, Tom; Crevecoeur, Guillaume (July 2019). "Path Integral Policy Improvement with Differential Dynamic Programming". 2019 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM). pp. 739–745. doi:10.1109/AIM.2019.8868359. ISBN 978-1-7281-2493-3. https://ieeexplore.ieee.org/document/8868359. 
  13. Theodorou, Evangelos; Buchli, Jonas; Schaal, Stefan (May 2010). "Reinforcement learning of motor skills in high dimensions: A path integral approach". 2010 IEEE International Conference on Robotics and Automation. pp. 2397–2403. doi:10.1109/ROBOT.2010.5509336. ISBN 978-1-4244-5038-1. 
  14. Pavlov, Andrei; Shames, Iman; Manzie, Chris (2020). "Interior Point Differential Dynamic Programming". IEEE Transactions on Control Systems Technology 29 (6): 2720. doi:10.1109/TCST.2021.3049416. Bibcode2021ITCST..29.2720P. 
  • The open-source software framework acados provides an efficient and embeddable implementation of DDP.